Skip to content

Standardize parsing of symbol and string array literals#16748

Draft
straight-shoota wants to merge 6 commits intocrystal-lang:masterfrom
straight-shoota:refactor/array_literal_parsing
Draft

Standardize parsing of symbol and string array literals#16748
straight-shoota wants to merge 6 commits intocrystal-lang:masterfrom
straight-shoota:refactor/array_literal_parsing

Conversation

@straight-shoota
Copy link
Copy Markdown
Member

@straight-shoota straight-shoota commented Mar 17, 2026

Previously, string array literal parsing used a custom implementation, combined with a custom lexer mode (next_string_array_token).

This patch changes that to reuse the same parser and lexer features as other string literals. This standardization removes an extra concept and thus reduces code surface area. In return, we're adding some conditionals for this specific literal type into the standard implementations. This increases complexity a little bit, but it's still fairly easy to reason about and a common pattern in the parser.
Combining the parsing of all string-related literals clearly shows their differences (in the form of such conditionals) and ensures consistent behaviour.

In particular, this change establishes standard behaviour for escape characters, and thus fixes #12277
However, it's unclear whether we accept this behaviour change (see #12277 (comment)). If not, we need to implement a non-standard escape algorithm to ensure backwards-compatibility.

The main motivation for this refactor is that it opens the path for implementing string literals with interpolation (RFC 0021).

Resolves #12277

Copy link
Copy Markdown
Collaborator

@ysbaddaden ysbaddaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tiny downside is that the "found newline" information is lost outside of the method body. I'd be tempted to have if _found_newline = skip_space_in_percent_array_literal.

Co-authored-by: Julien Portalier <julien@portalier.com>
@straight-shoota straight-shoota added the breaking-change May have breaking effect in some edge cases. Mostly negligible, but still noteworthy. label Mar 19, 2026
@straight-shoota straight-shoota marked this pull request as ready for review March 19, 2026 19:56
@straight-shoota straight-shoota added this to the 1.20.0 milestone Mar 19, 2026
@straight-shoota straight-shoota marked this pull request as draft March 22, 2026 11:05
@straight-shoota straight-shoota removed this from the 1.20.0 milestone Mar 22, 2026
@straight-shoota
Copy link
Copy Markdown
Member Author

straight-shoota commented Mar 22, 2026

An ecosystem test showed that 9a5de72 introduces a regression. The lexer breaks in macro mode when a string literal contains what looks like the beginning of a macro var (and variations of that).

{% begin %}
%q(%t{)
{% end %}

This seems to be a regression from #16672 (without it, this patch should work fine).
But part of the problem is also that the lexer code for macro tokens is a convoluted mess of spaghetti. So it might be a good idea to straighten that out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change May have breaking effect in some edge cases. Mostly negligible, but still noteworthy. kind:refactor topic:tools:formatter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Backslashes in %w() and %i() misbehave

2 participants